Current Issue : January-March Volume : 2022 Issue Number : 1 Articles : 5 Articles
This work reviews the state of the art in multimodal speech emotion recognition methodologies, focusing on audio, text and visual information. We provide a new, descriptive categorization of methods, based on the way they handle the inter-modality and intra-modality dynamics in the temporal dimension: (i) non-temporal architectures (NTA), which do not significantly model the temporal dimension in both unimodal and multimodal interaction; (ii) pseudo-temporal architectures (PTA), which also assume an oversimplification of the temporal dimension, although in one of the unimodal or multimodal interactions; and (iii) temporal architectures (TA), which try to capture both unimodal and cross-modal temporal dependencies. In addition, we review the basic feature representation methods for each modality, and we present aggregated evaluation results on the reported methodologies. Finally, we conclude this work with an in-depth analysis of the future challenges related to validation procedures, representation learning and method robustness....
Using fake audio to spoof the audio devices in the Internet of Things has become an important problem in modern network security. Aiming at the problem of lack of robust features in fake audio detection, an audio streams’ hidden feature extraction method based on a heuristic mask for empirical mode decomposition (HM-EMD) is proposed in this paper. First, using HMEMD, each signal is decomposed into several monotonic intrinsic mode functions (IMFs). Then, on the basis of IMFs, basic features and hidden information features HCFs of audio streams are constructed, respectively. Finally, a machine learning method is used to classify audio streams based on these features. The experimental results show that hidden information features of audio streams based on HM-EMD can effectively supplement the nonlinear and nonstationary information that traditional features such as mel cepstrum features cannot express and can better realize the representation of hidden acoustic events, which provide a new research idea for fake audio detection....
In today’s society, the continuous deepening of international cultural integration has become the background of the times. China has become more and more closely connected with the world, and many physical or online news media have become a platform for China to receive world information and spread Chinese culture. Business English translation is therefore valued by translation researchers and translators. Aiming at the shortcomings of current business English translation research, this paper designs and develops a business English translation architecture based on artificial intelligence speech recognition and edge computing. First of all, considering the relevance and complementarity between speech and text modalities, this paper uses the deep neural network feature fusion method to effectively fuse the extracted monomodal features and perform speech recognition. Secondly, adopt the edge computing method to establish the business English translation system architecture. Finally, the simulation test analysis verifies the efficiency of the business English translation framework established in this paper. Compared with the existing methods, our proposal improved the accuracy than others at least 10% and the time of model building also decreased obviously. The purpose of this research is to discuss how to deal with the many differences between the source language and the target language, and how to enhance the readability of the translation and meet the reader’s cultural cognition and needs....
With the support of big data and information technology, various sectors such as sports, health, and medical industry can realize the integration and readjustment of the existing resources, which improve the operation efficiency of the industry and tap its huge potential. With the advancement in big data analysis, voice features, and Internet of &ings (IoT), personalized health management is becoming the development trend and breakthrough of sports and health industry. &e application of big data will tap out the huge potential of the sports and health industry. In this paper, we have used the Mel-requency cepstrum coefficient as the speech feature processing method. When the linear frequency is transformed to the Mel frequency by Fourier transform, the calculation accuracy will decrease with the increase in the frequency, and the low-frequency signal will be retained to improve the anti-noise ability. With further study of the voice feature processing and IoTmodel of big data’s sports and health management, a vector addition regression was developed to compare the two real scoring features of the processing results that pave the way for further analysis and result evaluation. &rough experimental verification, it is proved that the method in this paper can better learn the speech features. At the same time, with the introduction of noise reduction, the big data of speech recognition in sports health management has a stronger robustness and improves the overall system performance....
Speech emotion recognition (SER) is an important research topic. Image features like spectrograms are one of the common ways of extracting information from speech. In the area of image recognition, a relatively novel type of network called capsule networks has shown good and promising results. 1is study aims to use capsule networks to encode spatial information from spectrograms and analyse its performance when paired with different loss functions. Experiments comparing the capsule network with models from previous works show that the capsule network performs better than them....
Loading....